Etienne Ackermann, 07/02/2015
This series of notebooks will go through the steps of analyzing some data from the CRCNS hc-3 data set, as well as generating some of the figures to be used on my SfN poster.
The data can be downloaded from the CRCNS (Collaborative Research in Computational Neuroscience) website, and the hc-3
data set in particular.
Here I will use a modified version of the Python module YAHMM (yet another hidden Markov model) by Jacob Schreiber.
The (rough) workflow of my YAHMM experiment is as follows:
df_spk_counts
df_spk_counts
) into training and a validation setsdf_spk_counts
and sequence summary info to build sequences of observations suitable for YAHMM; store these in training_sequences
and validation_sequences
, respectivelyFrom the data set description page:
The data set contains recordings made from multiple hippocampal areas in Long-Evans rats, including: Cornu Ammonis: CA1 and CA3; dentate gyrus (DG), and entorhinal cortex: EC2, EC3, EC4, EC5. The data was obtained during 442 recording sessions and includes responses of 7,737 neurons in eleven animals while they performed one of fourteen behaviors. Total time for all experiments is 204.5 hours. The raw (broadband) data was recorded at 20KHz, simultaneously from 31 to 127 channels. The raw data was processed to extract the LFPs (Local Field Potentials) and detect spikes. Included in the data set are the following items:
- Times and waveforms of detected potential spikes.
- Results of spike sorting.
- LFPs.
- For many sessions, the coordinate and direction of the rats head and video files from which the positions were extracted.
- Metadata tables giving properties of the neurons and characteristics of the recording sessions.
I will start by considering gor01 and vvp01 (animals). In particular, they have the following number of recorded sessions:
Task | gor01 | vvp01 | Description |
---|---|---|---|
linear one | 3 | 5 | linear track (250 cm?) |
linear two | 3 | 5 | linear track (250 cm?) |
open | 3 | open field | |
sleep | 1 | sleeping | |
T maze | 2 | 1 | |
wheel | 1 | operant wheel running task; see Mizuseki et al., 2009 |
In this way, I might be able to distinguish between different environments using my HMM implementation. There are (perhaps) easy cases, such as linear vs wheel, and perhaps less easy cases, such as linear one vs linear two.
Ideally, I would like to see
although this remains to be seen...
Also note that spike sorting occurs on a daily basis, so that sorted units are only meaningful within a small number of recording sessions.
Note that both gor01 and vvp01 were recorded with a Neuralynx recording system at 32,552 Hz, then amplified 1000x, followed by 1–5 kHz bandpass filtering.
In [1]:
import sys
sys.path.insert(0, '../')
import numpy as np
import pandas as pd
import pickle
import seaborn as sns
#import yahmm as ym
from matplotlib import pyplot as plt
from pandas import Series, DataFrame
from efunctions import * # load my helper functions
%matplotlib inline
In [2]:
import os
def get_sessions_for_animal(a_dir, animal):
return [name for name in os.listdir(a_dir) if (os.path.isdir(os.path.join(a_dir, name))) & (name[0:len(animal)]==animal)]
def get_trials_for_session(SessionInfo, session=0):
mypath = SessionInfo['datadir'] + '/' + SessionInfo['sessions'][session]
SessionInfo['LinOne'] = [mypath + '/' + name + '/' + name[:-5] for name in os.listdir(mypath) if (os.path.isdir(os.path.join(mypath, name))) & (name[-4:]=='lin1')]
SessionInfo['LinTwo'] = [mypath + '/' + name + '/' + name[:-5] for name in os.listdir(mypath) if (os.path.isdir(os.path.join(mypath, name))) & (name[-4:]=='lin2')]
SessionInfo=dict()
SessionInfo['datadir'] = '/home/etienne/Dropbox/neoReader/Data'
#SessionInfo['datadir'] = 'C:/Etienne/Dropbox/neoReader/Data'
SessionInfo['animal'] = 'gor01'
SessionInfo['sessions'] = get_sessions_for_animal(SessionInfo['datadir'],SessionInfo['animal'])
get_trials_for_session(SessionInfo, session=0)
SessionInfo
Out[2]:
In [5]:
# specify session parameters
num_electrodes = 12
for ele in np.arange(num_electrodes):
dt1a = pd.read_table( SessionInfo['LinOne'][0] + '.clu.' + str(ele + 1), skiprows=1, names='u' )
dt1b = pd.read_table( SessionInfo['LinOne'][0] + '.res.' + str(ele + 1), header=None, names='t' )
dt2a = pd.read_table( SessionInfo['LinTwo'][0] + '.clu.' + str(ele + 1), skiprows=1, names='u' )
dt2b = pd.read_table( SessionInfo['LinTwo'][0] + '.res.' + str(ele + 1), header=None, names='t' )
ls1a = list(dt1a['u'])
ls1b = list(dt1b['t'])
ls2a = list(dt2a['u'])
ls2b = list(dt2b['t'])
d1 = {'clu' + str( ele + 1 ): Series(ls1a, index=ls1b)}
d2 = {'clu' + str( ele + 1 ): Series(ls2a, index=ls2b)}
if ele == 0:
df1 = DataFrame(d1)
df2 = DataFrame(d2)
else:
df1 = df1.append(DataFrame(d1))
df2 = df2.append(DataFrame(d2))
print( 'The number of spikes recorded on each electrode is as follows:' )
print( 'For linear one:' )
print( df1.count() )
print( 'For linear two:' )
print( df2.count() )
print( 'The total number of spikes detected (lin1, lin2):' + str(sum(df1.count())) + ', ' + str(sum(df2.count())))
However, we now need to transform the data so that we can view the spikes by sorted units, and not on which electrodes they were recorded.
First, we know that units 0 and 1 should be omitted from any analyses, since these correspond to mechanical noise, and small, unsortable spikes, respectively.
So let's remove all rows of our DataFrames with spikes corresponding only to clusters 0 or 1:
In [6]:
cidx = 0
idx1 = list(df1[df1==cidx].dropna(how='all').index)
idx2 = list(df2[df2==cidx].dropna(how='all').index)
df1 = df1.drop(idx1)
df2 = df2.drop(idx2)
cidx = 1
idx1 = list(df1[df1==cidx].dropna(how='all').index)
idx2 = list(df2[df2==cidx].dropna(how='all').index)
df1 = df1.drop(idx1)
df2 = df2.drop(idx2)
print( 'The number of spikes recorded on each electrode (after removing spikes corresponding to clusters 0 and 1) is as follows:' )
print( 'For linear one:' )
print( df1.count() )
print( 'For linear two:' )
print( df2.count() )
print( 'The total number of spikes detected (lin1, lin2):' + str(sum(df1.count())) + ', ' + str(sum(df2.count())))
This clean-up operation has significantly reduced the size of our DataFrame, so extracting spike trains should now be a little bit faster, or at least require less memory.
Note that this clean-up is not perfect. In particular, the count operation counts only numbers per column, and ignores the NaNs, UNLESS the entire column is filled with NaNs, in which case, it counts the number of NaNs. This is the case for df2.clu5, for example, where we have 5583 NaNs after clean-up, and no actual useable spikes.
We can now request spike trains very similar to our clean-up operation. That is, we find all the rows in which any column has a value for the unit number that we want, and we note the timestamps. This is our spike train for that unit, although the timestamps have not been sorted yet.
Let's do this now. But first, as an example, consider the following:
In [7]:
zero_spk_cols = [ x|y for (x,y) in zip((df1.count()==0).tolist(), (df2.count()==0).tolist())]
zspi = [i for i, elem in enumerate(zero_spk_cols) if elem]
df1.drop(df1.columns[zspi], axis=1, inplace=True) # WARNING! df1 and df2 must have the same column order!
df2.drop(df2.columns[zspi], axis=1, inplace=True) # WARNING! df1 and df2 must have the same column order!
In [8]:
df1.sort_index().head()
Out[8]:
here we have called sort_index()
(which does not replace the order in df1
by the way), from where we can see that the first three spikes have been attributed to unit 4, and the next two spikes to unit 14.
Also, since we would like to ultimately sort the spike trains by timestamp for all of the units, we may as well sort the entire DataFrame once, and then extract the (already-sorted) spike trains from there.
In [9]:
df1 = df1.sort_index()
df2 = df2.sort_index()
In [10]:
def rename_units(dfextern):
df = dfextern.copy()
prevmax = 0
num_shanks = len(df.columns)
for ss in np.arange(0,num_shanks):
if df[df.columns[ss]].min() == 2:
df.loc[:,df.columns[ss]][df.loc[:,df.columns[ss]].notnull()] = df.loc[:,df.columns[ss]][df.loc[:,df.columns[ss]].notnull()] + prevmax - 1
prevmax = df.loc[:,df.columns[ss]].max()
return df
In [11]:
df1rn = rename_units(df1)
df2rn = rename_units(df2)
In [12]:
df1rn.max()
Out[12]:
and now we can request the spike trains:
In [13]:
num_units1 = int(max( df1rn.max() ))
num_units2 = int(max( df2rn.max() ))
spk_counts1 = {}
spk_counts2 = {}
st_list1 = []
st_list2 = []
for uu in np.arange(1,num_units1+1):
st1 = list((((df1rn[df1rn==uu])).dropna(how='all')).index)
spk_counts1['u' + str(uu)] = len(st1)
st2 = list((((df2rn[df2rn==uu])).dropna(how='all')).index)
spk_counts2['u' + str(uu)] = len(st2)
st_list1.append(st1)
st_list2.append(st2)
# convert list of lists to list of ndarrays:
st_array1rn = [np.array(a) for a in st_list1]
st_array2rn = [np.array(a) for a in st_list2]
In [14]:
import pickle
with open('../../Data/st_array1rn126.pickle', 'wb') as f:
pickle.dump(st_array1rn, f)
with open('../../Data/st_array2rn126.pickle', 'wb') as f:
pickle.dump(st_array2rn, f)
In [15]:
def list_of_spk_time_arrays_to_spk_counts_arrays(st_array_extern, ds=0, fs=0 ):
"""
st_array: list of ndarrays containing spike times (in sample numbers!)
ds: delta sample number; integer value of samples per time bin
fs: sampling frequency
argument logic: if only st_array is passed, use default ds; if ds is passed, use as is and ignore fs; if ds and fs are passed, use ds as time in seconds
returns a (numBins x numCell) array with spike counts
"""
st_array = st_array_extern
if fs == 0:
if ds == 0:
ds = 1000 # assume default interval size
else: # sampling frequency was passed, so interpret ds as time-interval, and convert accordingly:
if ds == 0:
ds = 1000 # assume default interval size
else:
ds = round(ds*fs)
# determine number of units:
num_units = len(st_array)
#columns = np.arange(0,num_units)
#df = DataFrame(columns=columns)
maxtime = 0
for uu in np.arange(num_units):
try:
maxtime = max(st_array[uu].max(), maxtime)
except:
maxtime = maxtime
# create list of intervals:
intlist = np.arange(0,maxtime,ds)
num_bins = len(intlist)
spks_bin = np.zeros((num_bins,num_units))
print("iterating over {0} intervals...".format(num_bins))
for uu in np.arange(num_units):
# count number of spikes in an interval:
spks_bin[:,uu] = np.histogram(st_array[uu], bins=num_bins, density=False, range=(0,maxtime))[0]
#spk_count_list.append([x&y for (x,y) in zip(st_array[uu]>ii, st_array[uu] < ii+ds)].count(True))
#st_array[uu] = st_array[uu][st_array[uu]>ii+ds]
#if df.empty:
# df = DataFrame([spk_count_list], columns=columns)
#else:
# df = df.append(DataFrame([spk_count_list], columns=columns),ignore_index=True)
return pd.DataFrame(spks_bin)
df_spk_counts1 = list_of_spk_time_arrays_to_spk_counts_arrays(st_array1rn, ds=0.25, fs=32552)
df_spk_counts2 = list_of_spk_time_arrays_to_spk_counts_arrays(st_array2rn, ds=0.25, fs=32552)
#df_spk_counts2 = list_of_spk_time_arrays_to_spk_counts_df(st_array2, ds=500000)
In [16]:
df_spk_counts1.head()
Out[16]:
In [17]:
df_spk_counts2.head()
Out[17]:
In [108]:
#import pickle
# WARNING!!! Pickling does NOT preserve meta-data as per below...
#df_spk_counts1._description = 'gor01-6-7/2006-6-7_11-26-53_lin1'
#df_spk_counts1._timebin = '250 ms'
#df_spk_counts2._description = 'gor01-6-7/2006-6-7_16-40-19_lin2'
#df_spk_counts2._timebin = '250 ms'
#with open('../../Data/df_spk_counts1_250ms_rn.pickle', 'wb') as f:
# pickle.dump(df_spk_counts1, f)
#
#with open('../../Data/df_spk_counts2_250ms_rn.pickle', 'wb') as f:
# pickle.dump(df_spk_counts2, f)
In [ ]:
In [ ]: